Performance and Programming Experience on the Tera MTA

نویسندگان

  • Larry Carter
  • John Feo
  • Allan Snavely
چکیده

The Tera MTA (for \Multithreaded Architecture") computer features a radically new architecture, with hardware support for up to 128 threads per processor, a powerful instruction set, nearly uniform access time to all memory locations, and zero-cost synchronization and swapping between threads of control. Memory access latencies are tolerated by swapping between the threads. Given a multithreaded program with suucient parallelism, the scalable memory system should allow uncommonly good scaling to multiple processors. This paper gives a brief description of the MTA's architecture and a few observations about its programmability, and then presents some performance gures. 1 The Tera MTA Each processor of the Tera MTA has 128 streams, where a stream is hardware that includes a program counter and set of 32 registers. Each stream can be assigned to (at most) one program thread. 1 A stream can issue one instruction, but then must wait at least 21 cycles (the length of the instruction pipeline) before issuing another. However, instructions from diierent streams on the same processor can be pipelined. That is, each cycle the processor selects (fairly) one of the stream that is ready, and issues the next instruction for the thread assigned to that stream. If there are no ready streams, it issues a no-op (called a phantom). The Tera MTA uses a 128-bit VLIW instruction architecture, where each instruction can include one memory operation (either a Load or a Store) plus two other operations. The MTA has no data caches; instead, all memory references proceed through a 3-D torroidal network to the appropriate memory module and back to the issuing processor. This roundtrip that might require 150 or even more cycles. The network and memory system are designed to sustain a throughput of one memory reference per processor per cycle. Given the relatively long latency for each memory request, there will be many memory references \in ight" at any instant of time. Each stream is allowed to have up to eight outstanding memory references. If an application has enough instruction-level parallelism so that no memory reference is needed until eight instructions later, then just 21 stream, each executing an independent thread of instructions, provides the ability to tolerate 21 8 = 168 cycles of memory latency. When there is less ILP available, or the threads interact, more threads may be necessary to keep the processor busy. 1 A thread is a sequence of instruction. Threads are like …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Loop Parallelism on Tera MTA Using Sisal

The difficulty of programming parallel computers has impeded their wide-spread use. The problems are caused by existing hardware and software tools. The software problems on shared-memory and vector computers can be solved by using deterministic high-performance functional languages like SISAL. Distributed-memory computers have even more obstacles than shared-memory parallel machines. Research ...

متن کامل

Scheduling on the Tera MTA

This paper describes the scheduling issues speci c to the Tera MTA high performance shared memory multithreaded multiprocessor and presents solutions to classic scheduling problems. The Tera MTA exploits parallelism at all levels, from ne-grained instruction-level parallelism within a single processor to parallel programming across processors, to multiprogramming among several applications simu...

متن کامل

Parallel Conjugate Gradient : E ects of Ordering Strategies , Programming Paradigms , and Architectural

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive deenite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the oating-point operations within a CG iteration. In this paper, we investigate the eeects of various ordering and partitioning strategies on the performance of parallel CG...

متن کامل

Parallel Conjugate Gradient : E ects of Ordering Strategies ,

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive deenite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the oating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel C...

متن کامل

Symbiotic Jobscheduling on the Tera MTA

Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Architecture) coscheduled jobs share system resource very intimately on a cycle by cycle basis. This can...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999